True, if the response variable has correlation 0 with all the predictor variables, then the only predictor would be the intercept, with slope 0. In the simple linear regression case, this would be a horizontal line going through the data
True, even if predictor variables are perfectly correlated, the model can still be a good fit for the data. Thinking geometrically, the \(dim(X) <p\) in the case of multicollineairty. However, since the space exists, we can still fit the data.
True, since \(\hat{Y} = \hat{Y}^*\), our sum of squares remains the same, thus having no influence on coefficients of multiple determination.
True, since the p-1 t-tests are not equivalent to testing whether there is a regression relation between Y and the set of X variables (as tested by the F test). When we have multicollinearity, this can be the case.
True, say for example we have a variable \(X_1\) and add \(X_2\), which is perfectly correlated with \(X_1\). The SSR remains unchanged, however the MSR becomes smaller. This could lead to having individually significant t tests for each coefficient but an insignificant model.
True, since both the error variance and the variance of the corresponding least squares estimator would be \(\sigma^2\).
True, \(\hat{\beta}^*\) is equal to \(\frac{s_y}{s_{x_k}} r_{yx_k}\), which is not influenced by other correlations with other X variables. It measures how much variation in \(X_k\) can explain \(Y\).
True, since the the inflated variance in \(\hat{\beta}^*_k\) is caused by the intercorrelation between \(X_k\) and the rest of the \(X\) variables.
library(ggplot2)
library(GGally)
library(plotly)
property <- read.table("property.txt")
colnames(property) <-
c("Ren.Rate", "Age", "Exp", "Vac.Rate", "Sq.Foot")
property
model1 <- lm(Ren.Rate ~ Age + Exp + Sq.Foot + Vac.Rate, data = property)
model1
##
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot + Vac.Rate, data = property)
##
## Coefficients:
## (Intercept) Age Exp Sq.Foot Vac.Rate
## 1.220e+01 -1.420e-01 2.820e-01 7.924e-06 6.193e-01
summary(model1)
##
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot + Vac.Rate, data = property)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1872 -0.5911 -0.0910 0.5579 2.9441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.220e+01 5.780e-01 21.110 < 2e-16 ***
## Age -1.420e-01 2.134e-02 -6.655 3.89e-09 ***
## Exp 2.820e-01 6.317e-02 4.464 2.75e-05 ***
## Sq.Foot 7.924e-06 1.385e-06 5.722 1.98e-07 ***
## Vac.Rate 6.193e-01 1.087e+00 0.570 0.57
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.137 on 76 degrees of freedom
## Multiple R-squared: 0.5847, Adjusted R-squared: 0.5629
## F-statistic: 26.76 on 4 and 76 DF, p-value: 7.272e-14
anova(model1)
# We find the coefficient of partial determination
R_3 <- 0.420 / (98.231 + 0.420)
R_3
## [1] 0.004257433
# partial correlation
r_3 <- sqrt(R_3)
r_3
## [1] 0.06524901
Here \(R^2_{Y3|124}\) is the reduction in the SSE when we add the variable \(X_3\), which here is only 0.3%.
addX3 <- lm(Vac.Rate ~ Age + Exp + Sq.Foot, data = property)
#Regressing the residuals
fit1 <- lm(model1$residuals ~ addX3$residuals)
ggplot() + aes(x = addX3$residuals, y = model1$residuals) + geom_point() + labs(x = "e(X3|X1,X2,X4)", y = "e(Y|X1,X2,X3,X4)", title = "Added Variable Plot for X3") + geom_abline(intercept = fit1$coefficients[1], fit1$coefficients[2])
The added variable plot implies that \(X_3\) is very little additional help in explaining \(Y\) when \(X_1\), \(X_2\) and \(X_4\) are already in the model.
model2 <- lm(Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
fit2 <- lm(model2$residuals ~ addX3$residuals)
summary(fit2)
##
## Call:
## lm(formula = model2$residuals ~ addX3$residuals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1872 -0.5911 -0.0910 0.5579 2.9441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.039e-16 1.239e-01 0.000 1.000
## addX3$residuals 6.193e-01 1.066e+00 0.581 0.563
##
## Residual standard error: 1.115 on 79 degrees of freedom
## Multiple R-squared: 0.004255, Adjusted R-squared: -0.008349
## F-statistic: 0.3376 on 1 and 79 DF, p-value: 0.5629
We can see that the the fitted regression slope is the same value as the regression coefficient of \(X_3\) from part b.
# From part d
anova(fit2)
anova(model1)
We can see that the \(SSR(residuals) = SSR(X_3 |X_1, X_2, X_4)\). Along with that we can see that both models have the exact same SSE.
# We find the correlation between the residuals of the models
r <- cor(model2$residuals, addX3$residuals)
r
## [1] 0.06522951
r^2
## [1] 0.004254889
We can see that the correlation coefficient \(r\) for the residuals is the same as the correlation coefficient \(r_{Y3|124}\). \(r^2 = 0.00425\).
Yresid <- lm(property$Ren.Rate ~ addX3$residuals)
summary(Yresid)
##
## Call:
## lm(formula = property$Ren.Rate ~ addX3$residuals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7641 -1.1392 -0.1056 1.1221 4.1630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.1389 0.1921 78.807 <2e-16 ***
## addX3$residuals 0.6193 1.6528 0.375 0.709
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.729 on 79 degrees of freedom
## Multiple R-squared: 0.001774, Adjusted R-squared: -0.01086
## F-statistic: 0.1404 on 1 and 79 DF, p-value: 0.7089
We can see that the fitted regression coefficient for Y regressing to the residuals is the same as the fitted regression coefficient of \(X_3\) from part b. This can be due to a portion of the variation in the residuals of \(X_1, X_2, X_4\) can be explained by \(X_3\).
#Lets clean up the data and return it to original form
property <- read.table("property.txt")
colnames(property) <-
c("Ren.Rate", "Age", "Exp", "Vac.Rate", "Sq.Foot")
property
#Means of each column
colMeans(property)
## Ren.Rate Age Exp Vac.Rate Sq.Foot
## 1.513889e+01 7.864198e+00 9.688148e+00 8.098765e-02 1.606333e+05
# SD, we use sapply function
sapply(property, sd)
## Ren.Rate Age Exp Vac.Rate Sq.Foot
## 1.719584e+00 6.632784e+00 2.583169e+00 1.345512e-01 1.090990e+05
# Standardize the data
property[,2:5] <- scale(property[,2:5])
#Means of each column standardized
round(colMeans(property),4)
## Ren.Rate Age Exp Vac.Rate Sq.Foot
## 15.1389 0.0000 0.0000 0.0000 0.0000
# S for standardized
sapply(property, sd)
## Ren.Rate Age Exp Vac.Rate Sq.Foot
## 1.719584 1.000000 1.000000 1.000000 1.000000
modelSTD <- lm(Ren.Rate ~ Age + Exp + Sq.Foot + Vac.Rate, data = property)
modelSTD
##
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot + Vac.Rate, data = property)
##
## Coefficients:
## (Intercept) Age Exp Sq.Foot Vac.Rate
## 15.13889 -0.94208 0.72850 0.86453 0.08333
\[ Y_i = 15.14 - 0.94X_1 + 0.73X_2 + 0.86X_4 + 0.08X_3 \]
The fitted regression intercept for the standardized model is 15.13889.
# ANOVA from standardized model
anova(modelSTD)
# ANOVA from model1
anova(model1)
The ANOVA from the standardized model and the original Model 1 have the exact same ANOVA table, with \(SSE = 98.231\), \(SSR = 138.328\) and \(SSTO = 236.559\).
summary(modelSTD)
##
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot + Vac.Rate, data = property)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1872 -0.5911 -0.0910 0.5579 2.9441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.13889 0.12632 119.845 < 2e-16 ***
## Age -0.94208 0.14156 -6.655 3.89e-09 ***
## Exp 0.72850 0.16318 4.464 2.75e-05 ***
## Sq.Foot 0.86453 0.15108 5.722 1.98e-07 ***
## Vac.Rate 0.08333 0.14623 0.570 0.57
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.137 on 76 degrees of freedom
## Multiple R-squared: 0.5847, Adjusted R-squared: 0.5629
## F-statistic: 26.76 on 4 and 76 DF, p-value: 7.272e-14
The \(R^2\) and \(R^2_a\) are the same in the standardized model as they are in the original model, \(0.5847\) and \(0.5629\) respectively.
rxx1 <- solve(cor(property[,2:5]))
rxx1
## Age Exp Vac.Rate Sq.Foot
## Age 1.2403482 -0.2870567 0.2244927 -0.2495354
## Exp -0.2870567 1.6482246 0.6092380 -0.6926391
## Vac.Rate 0.2244927 0.6092380 1.3235525 -0.4399669
## Sq.Foot -0.2495354 -0.6926391 -0.4399669 1.4127219
# Variance inflation factors are the diagonal values
diag(rxx1)
## Age Exp Vac.Rate Sq.Foot
## 1.240348 1.648225 1.323552 1.412722
sumX1 <- summary(lm(Age ~ Exp + Vac.Rate + Sq.Foot, data = property))
1 / (1 - sumX1$r.squared)
## [1] 1.240348
sumX2 <- summary(lm(Exp ~ Age + Vac.Rate + Sq.Foot, data = property))
1 / (1 - sumX2$r.squared)
## [1] 1.648225
sumX3 <- summary(lm(Vac.Rate ~ Exp + Age + Sq.Foot, data = property))
1 / (1 - sumX3$r.squared)
## [1] 1.323552
sumX4 <- summary(lm(Sq.Foot ~ Vac.Rate+ Exp + Age, data = property))
1 / (1 - sumX4$r.squared)
## [1] 1.412722
We can see that the the degree of multicollinearity in this data is nothing to be concerned about, as there is no high correlation between any of the variables. We also do not have any large variance inflation factors over 10.
YX4model <- lm(Ren.Rate ~ Sq.Foot, data = property)
YX3X4model <- lm(Ren.Rate ~ Vac.Rate + Sq.Foot, data = property)
summary(YX4model)
##
## Call:
## lm(formula = Ren.Rate ~ Sq.Foot, data = property)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1390 -0.7930 0.2890 0.9653 3.4415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.1389 0.1624 93.215 < 2e-16 ***
## Sq.Foot 0.9204 0.1634 5.632 2.63e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.462 on 79 degrees of freedom
## Multiple R-squared: 0.2865, Adjusted R-squared: 0.2775
## F-statistic: 31.72 on 1 and 79 DF, p-value: 2.628e-07
summary(YX3X4model)
##
## Call:
## lm(formula = Ren.Rate ~ Vac.Rate + Sq.Foot, data = property)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1886 -0.7879 0.3140 0.9820 3.4021
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.13889 0.16338 92.659 < 2e-16 ***
## Vac.Rate 0.04046 0.16494 0.245 0.807
## Sq.Foot 0.91717 0.16494 5.561 3.63e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.47 on 78 degrees of freedom
## Multiple R-squared: 0.2871, Adjusted R-squared: 0.2688
## F-statistic: 15.7 on 2 and 78 DF, p-value: 1.859e-06
We find that the regression coefficient for \(X_4\) are very close to each other, going from \(0.9204\) to \(0.91717\).
anova(YX4model)
anova(YX3X4model)
The \(SSR(X_4) = 67.775\) and the \(SSR(X_4|X_3) = 66.858\). We find that the \(SSR(X_4)\) is slightly larger than the \(SSR(X_4|X_3)\). This decrease in variation is due to the slight variation that \(X_3\) adds when to the \(SSR\) when included in the model.
YX2model <- lm(Ren.Rate ~ Exp, data = property)
YX1X2model <- lm(Ren.Rate ~ Age + Exp, data = property)
summary(YX2model)
##
## Call:
## lm(formula = Ren.Rate ~ Exp, data = property)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5733 -0.9093 -0.1559 0.8290 4.7607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.1389 0.1750 86.49 < 2e-16 ***
## Exp 0.7115 0.1761 4.04 0.000123 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.575 on 79 degrees of freedom
## Multiple R-squared: 0.1712, Adjusted R-squared: 0.1607
## F-statistic: 16.32 on 1 and 79 DF, p-value: 0.0001231
summary(YX1X2model)
##
## Call:
## lm(formula = Ren.Rate ~ Age + Exp, data = property)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6473 -0.9041 -0.1574 0.4652 4.0687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.1389 0.1535 98.601 < 2e-16 ***
## Age -0.8330 0.1677 -4.967 3.92e-06 ***
## Exp 1.0354 0.1677 6.175 2.79e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.382 on 78 degrees of freedom
## Multiple R-squared: 0.3704, Adjusted R-squared: 0.3543
## F-statistic: 22.94 on 2 and 78 DF, p-value: 1.457e-08
We can see that the estimated regression coefficient for \(X_2\) increases from \(0.7115\) to \(1.0354\).
anova(YX2model)
anova(YX1X2model)
The \(SSR(X_2) = 40.503\) and the \(SSR(X_2|X_1) = 72.802\). We find that the \(SSR(X_2)\) is much smaller than the \(SSR(X_2|X_1)\). This increase in variation is due to the deduction of SSE when \(X_1\) is included in the model.
ggplotly(ggplot(data = property, aes(x = Age, y = Ren.Rate)) + geom_point())
We can see from the plot that there is no tell of a linear relationship between the age of a property and it’s rental rate.
We have model equation: \[ Y_i = \beta_0 + \beta_1\tilde{X_{i1}} + \beta_2X_{i2} + \beta_4X_{i4} + \beta_1\tilde{X_{i1}^2} \]
property["AgeCent"] <- property$Age - mean(property$Age)
property["AgeSq"] <- property$AgeCent ^ 2
polyModel <-
lm(Ren.Rate ~ AgeCent + AgeSq + Exp + Sq.Foot, data = property)
summary(polyModel)
##
## Call:
## lm(formula = Ren.Rate ~ AgeCent + AgeSq + Exp + Sq.Foot, data = property)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89596 -0.62547 -0.08907 0.62793 2.68309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.5242 0.2808 51.729 < 2e-16 ***
## AgeCent -1.2057 0.1692 -7.125 5.10e-10 ***
## AgeSq 0.6224 0.2561 2.431 0.0174 *
## Exp 0.8112 0.1519 5.340 9.33e-07 ***
## Sq.Foot 0.8778 0.1382 6.351 1.42e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.097 on 76 degrees of freedom
## Multiple R-squared: 0.6131, Adjusted R-squared: 0.5927
## F-statistic: 30.1 on 4 and 76 DF, p-value: 5.203e-15
#Plotting Observations Against Fitted Values
ggplotly(
ggplot() + aes(x = polyModel$fitted.values, y = property$Ren.Rate) + geom_point() + labs(x = "Fitted Values", y = "Observations", title = "Observartions against Fitted Values")
)
We have the regression function: \[ Y_i = 10.19 - 0.182X_{i1} + 0.314X_{i2} + 0.00008X_{i4} + 0.014X_{i1}^2 \] We find that our model is a good fit. We have a relatively good \(R^2_{adj}\) as well as fairly linear Observations against Fitted Values plot.
# Model 2
model2 <- lm(Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
summary(model2)
##
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0620 -0.6437 -0.1013 0.5672 2.9583
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.1389 0.1258 120.374 < 2e-16 ***
## Age -0.9562 0.1388 -6.891 1.33e-09 ***
## Exp 0.6901 0.1480 4.663 1.29e-05 ***
## Sq.Foot 0.8922 0.1424 6.265 1.97e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.132 on 77 degrees of freedom
## Multiple R-squared: 0.583, Adjusted R-squared: 0.5667
## F-statistic: 35.88 on 3 and 77 DF, p-value: 1.295e-14
We find that both the \(R^2\) and \(R^2_{adj}\) are higher in the quadratic model than the Model 2. The \(R^2\) for Model 2 is \(0.583\) and \(0.6131\) for the quadratic model. The \(R^2_{adj}\) for Model 2 is \(0.5667\) and \(0.5927\) for the quadratic model. This would lead us to conclude that the quadratic model is a better fit than Model 2.
To test our full model versus our reduced model, we have: \[ H_0: \beta_j = 0\ \text{for all} \ j\in \mathbf J\\ H_a: \text{not all} \ \beta_j: \ j\in \mathbf J\\ \] With test statistic and null distribution: \[ F^* = \frac{\frac{SSE(R)-SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}} \\ F^* \sim F_{(1- \alpha, df_R - df_F, df_F)} \]
We reject \(H_0\) if \(F^* > F_{(1- \alpha, df_R - df_F, df_F)})\).
#Find crtical value
qf(1 - 0.05, 77-76, 77)
## [1] 3.965094
anova(polyModel, model2)
Given that our value for our \(F^*\) is \(5.9078\), we reject \(H_0\) and conclude that our quadratic term is significant in the model at \(\alpha = 0.05\).
#Our prediction for model 2
predict(model2, data.frame(Age = 4, Exp = 10, Sq.Foot = 80000), interval = "prediction", level = 0.99)
## fit lwr upr
## 1 71396.95 41306.87 101487
# Our prediction for quadratic model
predict(polyModel, data.frame(AgeCent = 4, AgeSq = 16, Exp = 10, Sq.Foot = 80000), interval = "prediction", level = 0.99)
## fit lwr upr
## 1 70251.52 41040.56 99462.49
We can see that when we predict with a quadratic model, we get a lower valued interval than that of the prediction with Model 2.